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Abstract 

Given a set 5 of n points in d-dimensional Euclidean metric space X and a small positive real 
number e, we present an algorithm to preprocess S and answer queries that require finding a set 
S" C of e-approximate nearest neighbors (ANNs) to a given query point q € X . The following 
are the characteristics of points belonging to set S': 

Ws € S' , 3 a point p € X such that \pq\ < e and the nearest neighbor of p is s, and 

3 a s' G S" such that s' is a nearest neighbor of q. 
During the preprocessing phase, from the Voronoi diagram of S we construct a set of box 
trees of size 0(4'*^ (f)**-!) which facilitate in querying ANNs of any input query point in 
^(i^9t ~^ (f)'*^^) time. Here S equals to ij^)'^, and V is the volume of a large bounding box 
that contains all the points of set S. The average case cardinality of S' is shown to rely on S 
and e. 

1 Introduction 

Nearest neighbor searching has applications in knowledge discovery, data mining, pattern classifi- 
cation, machine learning, data compression, multimedia databases, and document retrieval. 

Prom the perspective of worst-case performance, an ideal solution for the nearest neighbor search 
would be to preprocess the points in O(nlgn) time, into a data structure requiring 0{n) space so 
that any query can be answered in O(lgn) time. In R^, this is possible by sorting S, and then 
using binary search to answer queries. In R^, this is possible by computing the Voronoi diagram 
for S and then using an algorithm for planar point location to find the cell containing the query 
point. However, in higher dimensions, the worst-case complexity of the Voronoi diagram grows as 
Q^j^[d/2]-j_ Specifically, no known method achieves the simultaneous goals of roughly linear space 
and logarithmic query time in higher dimensional spaces. This is the primary reason in resorting 
to approximation algorithms. Given a set 5" of n points, a point p G 5 is an e-approximate nearest 
neighbor (ANN) of a point q G i?'^, if the distance between q and p is at most e-times the Euclidean 
distance between q and its nearest neighbor in S. However, even in approximation schemes, usually 
the space, time, and query time complexities grow exponential in e. 

Netanyahu et al., p] presents an algorithm to find an ANN of a query point q in time 0(c Ign) 
(c relies on d and e), with a data structure of size 0{dn) constructed in 0{dn\gn) time. Their data 
structure, BED tree, is based on the hierarchical decomposition of space: subdivide space into a 
collection of cells, each of which is either a d-dimensional rectangle or a set-theoretic difference of 
two rectangles, one enclosed within another. The BBD-tree is based on a spatial decomposition 
that achieves both exponential cardinality decrease (like kd-trees) and the geometric size reduction 
(like quadtrees) as one descends the tree. After finding the leaf cell containing the query point q 
in O(lgn) time by a simple descend through the tree, one can find g's approximate nearest neigh- 
bors: enumerate the leaf cells in increasing order of distance from q until distance to the point p 
associated with a cell exceeds dist{p' , q) / {1 + e) where p' is the closest point to q in the output 
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Figure 1: Definition of e-approximate nearest neigfibors 



so far. The algorithm from Arya et al., [3] minimizes the number and complexity of cells by con- 
structing an approximate Voronoi diagram of S during the preprocessing phase. This yields a data 
structure of 0{n^'^~^ lg7) space that can answer e-ANN queries in time 0(lg (nj) + l/{e'y)^'^~^^^'^) 
for a space-query time trade-off parameter 7, where 2 < 7 < 1/e. The principal reduction in space 
arises from a form of deterministic sampling over BBD-tree. The approach also creates cells more 
economically using well-separated pair decomposition (WSPD) of the points by generating them 
along the bisector between well-separated pairs. In the case 7 = 1/e, the additional lg7 factor 
in space complexity can be avoided to have a data structure that answers queries in 0(lg(n/e)) 
time with 0{n/e'^~^) space (refer Arya et al., [1] and Chan [1]). Har-Peled [5j and Sabharwal et 
al., [6] suggests an approach to reduce the query version of ANN problem to point location in balls. 

In this paper, we are interested in querying for e-ANNs but the definition of e-ANN is bit varied 
from the literature. To our knowledge, this seems to be the first paper to give an algorithm for 
the following variant definition of e-ANN. For a query point q, the output of our algorithm is a set 
S' Q S wherein: Vs G S", 3 a point p in R'^ such that \pq\ < e and the nearest neighbor of p is s, 
and 3 a s' £ S' such that s' is a nearest neighbor of q. 

As shown in Fig. [H let {si, S2, ■ ■ ■ , sq} be the set S. Consider a query point q that belongs to 
the Voronoi region of sq. Suppose S' = {ss, sq}. The site S5 is in S' as there exists a point p^ such 
that \qp5\ < e and S5 is the nearest neighbor of p^. Similarly, the site sq is in S' becuase of \qq\ < e 
and Sq is the nearest neighbor of q. Consistent with the definition of e-ANN, not all sites that cor- 
respond to regions intersecting with the e-radius ball centered at q are in S' (for example, S4 ^ 5"). 

We first construct the Voronoi diagram VD of S. We do this by space partitioning VD (con- 
tained in a large bounding hyperbox) with smaller hyperboxes and organizing hyperboxes into a box 
tree data structure, termed as main box tree. We also cover specific regions of VD with hyperboxes 
whose bounding planes are not parallel to coordinate-planes and construct a set of corresponding 
box tree data structures, each of which is termed as an auxiliary box tree. The advantage is that 
the query phase does not require data structures corresponding to VD. It only requires main 
and auxiliary box tree data structures. Although our algorithm does not improve time or space 
bounds of previous algorithms, it seems to provide another direction (detailed in the Section [5|) 
for approaching query version of ANN finding problem. This is the main motivation in listing this 
result. 

2 Algorithm 

Our algorithm first constructs the Voronoi diagram VD corresponding to the input set 5* of points 
in d-dimensional Euclidean metric space. Before deleting data structures corresponding to VD, we 
construct a set of box trees from VD. We enclose a sufficiently large space of VD in a bound- 
ing hyperbox BB so that every site in S is an ANN to any point exterior to BB. We spatially 
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Figure 2: Data structures at the end of preprocessing phase 

partition BB into hyperboxes of volume 5 (a real number that relies on e), and with each such 
hyperbox HB we associate sites corresponding to Voronoi cells with which HB intersects. We 
organize these hyperboxes into a box tree, termed as the main box tree (refer Fig. [2j). Whenever 
the hyperbox HB stored at any leaf of main box tree is associated with multiple sites, we cover 
the region enclosed by HB with multiple hyperboxes, termed as auxiliary hyperboxes. Every such 
hyperbox is obtained by rotating HB w.r.t. its center by a distinct discrete angle. By partitioning 
the region enclosed by each of these auxiliary hyperboxes, we construct a box tree, termed as an 
auxiliary box tree (refer Fig. [2]). Like in the main box tree, leaf node I of every auxiliary box tree 
is associated with Voronoi sites corresponding to Voronoi cells that intersect with the hyperbox at I. 

Each internal/leaf node of main/auxiliary box tree implicitly associated with the hyperbox that 
corresponds to union of hyperboxes that are associated with its children. Also, every internal node 
V has information necessary to locate q in the hyperbox associated with a specific child of v. Given 
a query point q for which the user is interested in finding ANNs, we descend through the height of 
the main box tree in finding the leaf node I whose hyperbox HB contains q. If HB is associated 
with one and only one Voronoi site, we have determined the exact neighbor of q. Otherwise, similar 
to traversing the main box tree, we traverse each of the auxiliary box trees associated with leaf node 
I so as to confine the cardinality of set S' output by the algorithm, as detailed below. Essentially, 
during the query phase we rely on main and auxiliary box trees rather than on data structures that 
store Voronoi structure. 

2.1 Preprocessing phase 

The preprocessing phase starts by constructing a Voronoi diagram. 

The box tree data structure is a trivial generalization of quadtree data structure to higher 
dimensions. It is a rooted tree in which every internal node has 2*^ children. The root node of 
the main box tree corresponds to a hyperbox that is the union of all the hyperboxes into which 
BB is partitioned. The children of any internal node v are obtained by subdividing hyperbox HB 
corresponding to v with the hyperplanes parallel to coordinate planes, which bisect each dimension 
of HB. In other words, every internal node v is partitioned into hyperboxes, and each of them 
is implicitly stored at a distinct child of v. Note that since the box is symmetric the direction of 
rotation is not important. Each leaf node / of the main box tree stores the sites corresponding to 
Voronoi cells that intersect with the hyperbox at I as satellite data. A point in S belongs to which 
half-space defined by a hyperplane is decided by trivial means. This elementary operation assists 
in associating sites with the leaf nodes. 

The recursive definition of the box tree immediately translates into a recursive algorithm: par- 
tition the hyperbox associated with a node v into 2'^ hyperboxes using hyperplanes parallel to 
coordinate-planes. The recursion stops at a leaf node I of the main box tree whenever the volume 
of hyperbox HB at I is less than a real number S (which is defined in terms of e). 

Consider the case in which the hyperbox HB at a leaf node / is associated with multiple sites 
but the volume of HB is less than 6. In this case, we associate a linked list of auxiliary box trees 
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Figure 3: Auxiliary hyperboxes corresponding to roots of auxiliary box trees 

to leaf node /. The root of each of these auxiliary box trees correspond to an auxiliary hyperbox 
which is constructed by rotating HB by a solid angle e w.r.t. its center. However, since a query 
point that belongs to HB does not necessarily contain in any such auxiliary hyperbox, we double 
the size of HB before orienting it. 

Let HB[ be the hyperbox that is obtained by scaling HB by two w.r.t. its center (refer Fig. 
13]). The root of the first auxiliary box tree in the linked list is associated with a hyperbox HB'(, 
that is obtained by rotating HB[ w.r.t its center by e solid angle. The next auxiliary box tree in 
the linked list is obtained by rotating HB^ by e etc., so that no two of the auxiliary hyperboxes 
corresponding to two distinct auxiliary box trees are oriented with the same angle. In other words, 
the length of the linked list associated with I is (7)'^"^. Each of the auxiliary box trees are defined 
similarly as in main box tree: partition the hyperbox associated with a node v into 2*^ hyperboxes 
using hyperplanes parallel to coordinate axes. The recursion stops at a leaf node / whenever the 
volume of a hyperbox corresponding to a node is less than 6. 



2.2 Query phase 

Given a query point q in d-dimensional Euclidean space, we intend to find its ANNs. First, we find 
the leaf node in main box tree that contains q. This is accomplished by traversing one node at each 
level of the main box tree. The satellite data associated with internal nodes assists in branching 
to a particular child of v. The same could be accomplished in 0(1) time using hashing i.e., by 
maintaining a hash table rather than a box tree. However, to maintain the consistency between 
primary and secondary level data structures, we organize primary level data structure as a box tree. 

Let q belongs to a hyperbox HB associated with a leaf node / of the main box tree. If HB is asso- 
ciated with a single site s, then it is immediate to conclude that the exact neighbor of g is s. Other- 
wise, we traverse through the linked list L referred by node /. As mentioned earlier, each node of the 
linked list refers to an auxiliary box tree. Similar to the traversal of main box tree, we traverse each 
of the auxihary box trees referred by L, say HB'^^HB^, ... , HB'^. Let R C {HB, HB[, . . . HB'^} 
be the set that comprises of hyperboxes that contain q. Also, let 5i, . . . , be the sets of sites 
respectively associated with hyperboxes in R. The algorithm returns the set S' = Sir\S2C\. . .r\S\^ 
as the set of ANNs to q. 



3 Correctness 

From the Analysis mentioned in Section [U it is immediate that in both the preprocessing phase 
and in the query phase algorithm always terminates. We have chosen the size of bounding box BB 
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so that the algorithm is correct for ah the query points exterior to BB. 

Let q belongs to a hyperbox HB that corresponds to a leaf node / of the main box tree. If 
HB is associated with a single site, according to the definition of Voronoi diagram, the algorithm 
outputs the nearest neighbor of q. Otherwise, the proof of correctness follows. 

We are ensuring that the hyperbox corresponding to the root of every auxiliary box tree contains 
q by doubling the size of HB. The intersection of all the auxiliary hyperboxes associated with any 
leaf node I together yield a (approximate) hypersphere B whose boundary consists of hyperboxes 
of size e in R'^^^. Given the volume of the byperbox is 5, the length of the main diagonal of a 
hyperbox is ^fdb^l'^ . Hence, the diameter of hyperball B is less than or equal to 1\/db^l'^ . 

Let 5' be the set of sites whose Voronoi regions intersect with the hyperball B. Suppose q lies 
in B. In this case, though the output of the query algorithm does not have any site from the set 
S — S', no auxiliary box tree eliminates any site in S' . By choosing the diameter of B as e, every site 
in S' is an ANN of q. Hence, we choose 6 as (^^)'^- Consider the other case in which q does not 

lie in B. In this case, set intersection of sites associated to multiple auxiliary hyperboxes together 
define the ANNs of q. Since the main diagonal lengths of every hyperbox that is implicitly stored 
at any leaf node is upper bounded by e, the correctness is ensured again. 

4 Analysis 

Theorem 4.1 Assuming uniform distribution of S in R"^ Euclidean metric space, the expected 
cardinality of set S' is {^)'^\S\. Here, r is the radius of the smallest hyperball that contains S. 

Proof: First note that the hyperball of radius r has volume C^r^, where = —^j^^. Due to uni- 

\s\ 

form distribution of S, the expected number of sites per unit volume are Since the diameter 

of the hyperball B is e, the number of sites in B are (C'dCf )'^)(^]^)- D 

Lemma 4.1 The depth of the main box tree is at most ^Ig^ + 1, where V is the volume of the 
initial bounding box and 5 is the volume tolerance of the smallest possible box. 

Proof: The box volume goes down by 2'^ for every level, therefore the volume of the box at 
depth i is We know that the volume 6 is possible at one less than the maximum depth. 

Therefore, the smallest possible box volume is, Hence, for any hyperbox in the main box tree 
^>|r^^<^lg^ + l. □ 

Lemma 4.2 The maximum possible number of nodes in the main box tree are ^ZiT ~ 2^-1 

Proof: If we consider the complete box tree, the number of boxes in the box tree at the ith level 
are 2 . Therefore, the total number of nodes in the main box tree are l + 2'^ + 22'i + ...+(ilg:| + 2 
terms), summing over this geometric progression yields the result. □ 

Lemma 4.3 The maximum depth of any auxiliary box tree is 0(1). 

Proof: We know that the volume of any hyperbox HB at a leaf node of main box tree is 5. 
Since we scale HB by two, the volume of the resultant hyperbox at the root of auxiliary box tree is 
25. Then from Lemma [4. H the depth of the auxiliary box tree is at most ^ Ig ^ + 1, which is 0(1). □ 
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Arrows indicate the direction of expansion of hyperboxes into Voronoi cell so that they together cover that 
cell. 

Figure 4: Suggested approach 

Lemma 4.4 The maximum number of nodes in any auxiliary box tree are ^^^J^]"^ ■ 

Proof: This is immediate by substituting V = 25 in the statment of Lemma |4.2[ □ 

Theorem 4.2 Given the Voronoi diagram of S, the space complexity in constructing all box trees 
together is 0{4:'^^{ff-^). 

Proof: The main box tree is a complete tree and the number of leaf nodes are Also, each 

leaf may have a list of length From Lemma 14.41 the space complexity is as stated. □ 

Theorem 4.3 The preprocessing time in constructing all the box trees is 0{4:'^^{j)^~^+{dn){^)'^~^-\ 

Proof: Since only constant time is spent in creating each node of every box tree, the time com- 
plexity is same as the space complexity except that we need to consider the following. The cost 
in associating sites to main box tree leaves involve dn point location searches per each level of the 
main box tree. The same is applicable to auxiliary box trees. Including the time complexity of 
constructing the Voronoi diagram of S yields the stated. □ 



Theorem 4.4 The query time complexity is 0(^lg-j 



Proof: From Lemma 14. H the worst-case depth of the main box tree is ^Ig-y + 1. The worst- 
case length of any list at a leaf node of the main box tree is {-)'^~^. From Lemma 14. 3^ the 
worst-case height of the auxiliary box tree is 0(1). Therefore, the worst-case query time is 
i/5^ + l + (f)'^-iO(l). □ 

Note that each occurrence of 6 in the above Analysis can be expressed in terms of e, as 6 = (^^)'^- 

5 Conclusions 

This paper proposed an algorithm that builds data structures of size 

0[4dV(E)d-i) during the pre- 
processing phase which facilitate in querying ANNs of an input query point in Ig + (f)'^"^) 
time. Here d equals to (^^)'^) and V is the volume of the large bounding box that contains set S of 
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points. The expected cardinality of output set S' corresponding to a given query point is shown to 
rely on S and e. Although this algorithm does not improve either space or query-time complexity 
of existing algorithms, it suggests an approach to explore the problem under the following setting: 
Independently cover each Voronoi cell with O(^) hyperboxes (refer Fig. H]); this leads to a space 
complexity of O(^) (given that there are at most n Voronoi cells). Organize all the hyperboxes 
covering the entire Voronoi diagram into a data structure, similar to the one proposed herewith, to 
achieve query time efficiency. 
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